Loading Packages

Load the tidyverse package.

library(tidyverse)

Import chds6162_data

data <- read_csv("data/chds6162_data.csv")
## Rows: 1236 Columns: 23
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): drace
## dbl (22): id, pluralty, outcome, date, gestation, sex, wt, parity, race, age, ed, ht, wt.1, dage, ded...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

select

With the function select we can select variables (columns) from the larger data frame.

Use select to show just the gestation variable.

data_ges <- data %>%
  select(gestation)

We can also select a range of columns. select all the variables that belong to the father (they had a “d” in front of them) drace to dwt.

data %>%
  select(drace:dwt)
#What about just the id column and everything after the father information?
data %>%
  select(id, marital:last_col())

We can drop variables using the -var format. Drop the marital variable.

data %>%
  select(-(marital))

mutate

art by @allison_horst

We use mutate we make new variables or change existing ones.

Create a new variable with a specific value

Create a new variable called data_decade. Imagine that you will be merging this dataset from 61-62 to dataset from the 70’s. To make it easier, you will create this variable with the value “60s.”

data %>%
  mutate(data_decade = "60s")

Create a new variable based on other variables

Create a new variable called wt_k. This variable will give you information about mom’s weight pre-pregnancy(wt) in kilos (1 pound = .454 kilos).

data %>% 
  mutate(wt_k = wt*.454)
# too many decimals? let's round things

data %>% 
  mutate(wt_k = round((wt*.454),2))

case_when

art by @allison_horst

Change an existing variable

For some reason you want to have the real labels rather than the values. Let’s change the maritalvariable to its real values: 1 = married, 2 = legally separated, 3 = divorced, 4 = widowed, 5 = never married.

data %>% 
  mutate(marital = case_when(
    marital == 1 ~ "married",
    marital == 2 ~ "legally separated",
    marital == 3 ~ "divorced",
    marital == 4 ~ "widowed",
    marital == 5 ~ "never married"
  ))

filter

art by @allison_horst art by @allison_horst

We use filter to choose a subset of cases.

Use filter to keep only those whose parents have been never married (value 5 in the marital variable. Then, use select to show only the age variable to let us explore the mothers’ ages when they delivered.

data %>%
  filter(marital == 5) %>%
  select(age)

Use filter to keep only those who are not divorced (value 3).

data %>%
  filter(marital != 3)

Use %in% within the filter function to keep only those who are divorced (3), legally separated (2), or widowed (4).

data %>%
  filter(marital %in% c(2, 3,4))

Create a chain that keeps only those whose moms were in their 20s when they delivered and are college grads (val= 5). Use the age and ed variables.

data %>%
  filter(ed == 5, age %in% 20:30)

We can use <, >, <=, and >= for numeric data. == equal, != not equal

summarize or summarise

With summarize,as the name implies, you will get a summary of your dataset.

Get the mean days of gestation for this sample.

data %>%
  summarize(gestation_length_mean = round(mean(gestation, na.rm = TRUE),2))

We can have multiple arguments in each usage of summarize.

In addition to the mean gestation length you calculated earlier, calculate the minimum and maximum for the same variable.

data %>%
  summarize(mean_gestation_length = mean(gestation, na.rm = TRUE),
            min_gestation_length = min(gestation, na.rm = TRUE),
            max_gestation_length = max(gestation, na.rm = TRUE))

group_by

art by @allison_horst

group_byenables us to perform calculations on each of our groups.

Calculate the mean gestation length in days based on mother’s smoking habits (0=never, 1=smokes now, 2=until current pregnancy, 3=once did, not now) using group_by and summarize.

data %>% 
  group_by(smoke) %>%
  summarize(mean_gestation_length = mean(gestation,
                                    na.rm = TRUE))
#Since we know that there are some NAs in the Smoke variable we can go ahead and drop those NAs before we summarize them
data %>% 
  drop_na(smoke) %>%
  group_by(smoke,ed) %>%
  summarize(mean_gestation_length = mean(gestation,
                                    na.rm = TRUE))
## `summarise()` has grouped output by 'smoke'. You can override using the `.groups` argument.

We can use group_by with multiple groups.

across

art by @allison_horst

acrossenables us to perform calculations on each of our selected columns

In the example above, we calculated the mean for gestation but what about if I wanted the mean for gestation, mom age (age) and dad age (dage)

#If we didn't know better, we would do this:
data %>% 
  drop_na(smoke) %>%
  group_by(smoke) %>%
  summarize(mean_gestation_length = mean(gestation,
                                    na.rm = TRUE),
            mean_m_age = mean(age,
                                    na.rm = TRUE),
            mean_d_age = mean(dage,
                                    na.rm = TRUE))
#But we know better:
data %>% 
  drop_na(smoke) %>%
  group_by(smoke) %>%
  summarize(across(c(gestation,age,dage),mean,na.rm = TRUE))
?across

Create a new data frame

Sometimes you want to save the results of your work to a new data frame or replace the one you already have (not recommended).

to_graph <- data %>% 
  drop_na(smoke) %>%
  group_by(smoke) %>%
  summarize(across(c(gestation,age,dage),mean,na.rm = TRUE))

relocate

art by @allison_horst

Sometimes you want to export the table as a csv but you don’t really like the order of the columns. Relocate allows you to move them around without you having to go back to previous codes. Let’s take the to_graph data frame and move the age column before the gestation column.

?relocate

to_graph %>%
  relocate (age, .before = gestation)
#What if I wanted gestation to be the last column?
to_graph %>%
  relocate (gestation, .after = last_col())
# ok now I like it and want a CSV of it. 
# I make sure to save it as a dataframe and then export it
csv_to_export <- to_graph %>%
  relocate (gestation, .after = last_col())

# export csv
write_csv(csv_to_export, "exports/example_to_export.csv")